February 25, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Images represent a structured input that are difficult for many machine learning methods
Each colored instance is a \(3 \times H \times W\) tensor input
Location matters! Images are all about spatial context - a cat is a cat regardless of which way it is facing!
A lot of “features” per instance - a \(3 \times 32 \times 32\) image has 3072 pixel values!
The solution: Convolution Layers
Convolve the image with a filter of size \(C \times F \times F\) with learned parameters
Each filter learns some part of the image (edges, innards, etc.)
With enough filters in a layer, break an image down into its constituent parts!
CIFAR-10 Data Set:
50,000 instances of \(32 \times 32 \times 3\) RGB images
A tougher image task than digits
The CNN structure (AlexNet alike):
| Inputs | Layer | Output | ||||||
|---|---|---|---|---|---|---|---|---|
| Channels | H/W | Filters | FSize | Stride | Pad | Channels | H/W | |
| Conv1 | 3 | 32 | 64 | 7 | 1 | Same | 64 | 32 |
| MaxPool1 | 64 | 32 | 2 | 2 | 0 | 64 | 16 | |
| Conv2 | 64 | 16 | 128 | 5 | 1 | Same | 128 | 16 |
| MaxPool2 | 128 | 16 | 2 | 2 | 0 | 128 | 8 | |
| Conv3 | 128 | 8 | 256 | 3 | 1 | Same | 256 | 8 |
| MaxPool3 | 256 | 8 | 2 | 2 | 0 | 256 | 4 | |
| Flatten | 256 | 4 | 1024 | |||||
| FNN4 | 1024 | 512 | 512 | |||||
| FNN5 | 512 | 512 | 512 | |||||
| FNN6 | 512 | 10 | 10 |
We start with a rich image.
In the first convolution layer, we create high level feature maps for different aspects of the images
The pooling layers reduce the dimensionality of those feature maps (too much redundancy/a lot of white space due to ReLU)
Second layer convolves in and across the feature maps (further split up the features)
Pool/Convolve/Pool
Flatten and then do a 3 layer NN for the values!
Each successive convolution/pooling layer downsamples the feature map!
Make up for downsample with more filters!
Before we show off this structure, one more “layer” - normalization layers
Deep CNNs tend to have problems with vanishing gradients
Each filter is trying to learn a small part of the image.
The gradients get really small when the filters get specific
A solution that works (without much of a theoretical basis as to why it works) is to use batch normalization
At any step, we have a 4D tensor of feature maps:
\[\underset{\# of Training Instances}{N} \times \underset{\# of Channels}{C} \times \underset{Height in px}{H} \times \underset{Width in px}{W}\]
For a single channel:
\[\underset{\# of Training Instances}{N} \times \underset{Height in px}{H} \times \underset{Width in px}{W}\]
Goal: Normalize the output for each channel so that the have zero mean and unit variance!
Why?
Why?
For a single channel:
\[\underset{\# of Training Instances}{N} \times \underset{Height in px}{H} \times \underset{Width in px}{W}\]
\[\mu_c = \frac{1}{N \times H \times W} \sum \limits_{i,j,k} x_{i,j,k}\]
\[\sigma^2_c = \frac{1}{N \times H \times W} \sum \limits_{i,j,k} (x_{i,j,k} - \mu_c)^2\]
\[\hat{x}_{i,j,k} = \frac{x_{i,j,k} - \mu_c}{\sqrt{\sigma^2_c + \epsilon}}\]
\[y_{i,j,k} = \gamma_c \hat{x}_{i,j,k} + \delta_c\]
A normalization layer is typically placed after a convolution layer
Requires a slightly different operation at test time
Outlined in chapter 14 of PML1
Handled natively by PyTorch
Let’s try all this out for the CIFAR10 data!
AlexNet gets us to around 75% in validation accuracy!
One thing that we might want to do is visualize what each filter corresponds to in the original images
This can be really tough when we have multiple convolutional layers!
The bottom layers correspond to little pieces of the images
Higher layers correspond put the little pieces together into bigger pieces
Approach 1: Exemplars
At any layer, compute the hidden values for image \(i\) after using filter \(j\)
Find the images that correspond with the largest value in at that filter!
Look at top x and find commonalities to see what we can see.
Approach 2: Activation Maximization
At the bottom of the network, input is a \(C \times H \times W\) tensor of pixel values
Start with a random set of pixels
Feed this through the trained network until we get to the hidden representations at filter \(j\)
Compute the gradient at the filter for an input image
Ascend the gradient to increase the activation value!
Repeat over and over again.
Approach 2: Activation Maximization
In theory, this approach yields some visual representation of what image would activate most heavily at a specific filter.
Somewhat costly in practice
AlexNet does well on this data set
We can improve this by using a deeper architecture
CNN Architecture (VGG) Rules:
All convolutions are same 3x3 convolutions with stride 1
All max pools are 2x2 with stride 2
After pooling, double the number of filters/channels
Let’s go back over to the notebook to see what this looks like.
A little better, but still not state of the art!
The problem is that our networks are quickly overfitting to the data
Two dominant approaches to get better generalization for CNNs:
Dropout
Data Augmentation
As we saw before, dropout is the act of randomly turning off connections in the network with some probability.
The idea is that it forces the procedure to learn an ensemble of bad networks that work pretty well once averaged together
Reduce reliance on any one convolution or neuron
Let’s apply this to our VGG model to see if it works a little better
A clever image-specific approach is data augmentation
A cat is a cat is a cat
Even if it’s flipped over
Even if it’s closer to the camera
Even if it is facing left or right
Data Augmentation supplements the training set with new images that are random variations on the original images!
Pretty easy to implement in PyTorch with Torchvision
Data Augmentation slows down the training procedure!
More iterations needed to get to the minimum because the training data is constantly changing
The benefit is that the model becomes quite robust to small changes in images and can do a pretty good job of generalizing to new pictures!
We eek out more performance.
Looking at this data, we didn’t see the improvement that we would’ve hoped to see by making the model deeper
This is a common phenomenon in CNNs
This doesn’t jive with what we understand about DNNs
As the DNN gets deeper, we should see pretty good improvement in novel image problems
In fact, there is common phenomenon that really deep CNNs actually perform worse than shallow CNNs
This shouldn’t happen because a shallow network with a new layer should, at worst, do as well as the shallow model!
The problem is one of optimization rather than theoretical!
A novel solution to this problem is the residual network (He et al, 2016)
The basis of the ResNet is the residual block
We’re adding the new info learned via the convolution layers within the block to the original input
This works because it is asking each residual block to learn some small part of the overall mapping of the original features to the outcome
With enough little parts, we’re able to complete the mapping!
This architecture makes the gradients much less likely to disappear
The gradients flow directly from the output to the earlier layers
Since each block is learning its own little bit of the mapping of \(\mathbf x\) to y, the gradient for each residual block doesn’t depend on the depth
In other words, the gradients in each layer aren’t stacking and multiplying - we’re just adding!
A Vanilla CNN:
\[ x_k = f_k \circ f_{k-1} \circ \cdots \circ f_1(x) \]
After chain rule:
\[ \frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial x_k} \cdot \prod_{i=1}^{k} \left( \phi_i'(z_i) \, W_i \right) \]
If the spectral norm of each $ _i’(z_i) , W_i $ is less than 1, then as $ n $ increases the product can shrink exponentially. This leads to the vanishing gradient problem.
ResNets:
\[ H(x) = F(x) + x \]
The gradient:
\[ \frac{\partial H(x)}{\partial x} = \frac{\partial \left( F(x) + x \right)}{\partial x} = \frac{\partial F(x)}{\partial x} + \mathcal I \]
After chain rule:
\[ \frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial H(x)} \cdot \left( \frac{\partial F(x)}{\partial x} + \mathcal I \right) \]
With many layers:
\[ \frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial x_n} \cdot \prod_{i=1}^{k} \left( \mathcal I + J_{F_i}(x_i) \right) \]
Even if \(J_{F_i} (x_i)\) has eigenvalues less than 1, the identity matrix ensures that each term \(\mathcal I + J_{F_i}(x_i)\) is closer to the identity matrix.
Where have we seen this idea before?
Architecture:
Aggressive stem with a \(f \times f\) filter and pooling stage that cuts the initial image size by 75%; 64 filters
3 residual blocks with 64 filters
3 residual blocks with 128 filters
3 residual blocks 256 filters
3 residual blocks with 512 filters
Global average pooling and a single multinomial layer
Resnets also leverage some concepts popularized by Google in 2015 with GoogLeNet (an homage to LeNet)
No fully connected hidden layers at the top of the network!
Instead, the values in the feature map for each channel is averaged after the last convolution layer - this is called global average pooling
The idea is that the series of residual blocks has made the feature maps at the top layer so sparse that each image, more or less, corresponds to a few of the feature maps. All we need to know if which ones to predict what it is!
This architecture, more or less, match the performance of VGG without ResNets.
Any guess as to why?
Why do Residual Connections in the first place?
Even with my sick GPU, I can’t really train a model with 50 layers!
Where do these architectural choices come from?
The ImageNet Challenge!
ImageNet is a large data base of of more than 14 million hand annotated images
What objects are in the images?
For multi-object images, what is the main image?
More than 20,000 categories!
ImageNet Challenge:
For a standardized subset of ImageNet, predict the top-5 categories (in terms of softmax) for each image
If the true label for a test set is in the top 5, success!
This is a really hard problem!
The training set is really big - millions of images
The challenge set is 1,000 classes
Modern CNN architectures require serious compute power!
These models perform really well on this tough challenge!
You could replicate these results for the ImageNet challenge yourself!
These architectures really push the limits of what computers can do
ImageNet is a broad database with lots of different tagged images
Complicated models with lots of convolution layers create a large set of feature extractors
Take in the image, break it down to its constituent parts, use these parts to create class predictions
A thought: an image of a cat is an image of a cat no matter how it was taken
An image of a cat that was not seen in the training data is probably just some function of the different feature extractors learned by the CNN!
A team with millions of dollars has already trained a high performance feature extractor on ImageNet
Broad base
Performs well for a lot of different images
Is it possible for us to use that feature extractor to avoid needing to train the feature extractors (e.g. convolution layers) ourselves?
For image analysis, transfer learning is the norm
Transfer learning works!
The bottom layers learn to extract edges of images
As we move up, we learn to combine edges and innards to get small blocks of images
The process is pretty subject agnostic - an image is just a collection of lines and colors!
No need to keep retraining a method to extract lines and colors in patches from a dense image!
Process:
Take pretrained weights from a model
Pass input images for your task through this deep feature extractor
For each image, record the feature map at the last step
Use this feature map to train a FCNN that will categorize your data!
We’re just using a well-tuned feature extractor!
Let’s look at using pre-trained ResNet125
A few notes about transfer:
This is still very memory intensive. We don’t have to backprop through the convolutional layers, but we do have to compute the feature maps for all of the images! This is costly and requires a lot of resources.
Augmenting your data can help with the training. However, the pre-trained model will have used augmentation, dropout, etc. on training. It will eek out a little more performance!
Transfer learning is the norm!
Images are images are images (just like songs are just pieces of other songs)
SoTA CNNs are learning how to break images into parts. The feature extractor just does this in a really clever way.
It is almost always a waste of time to not use a pre-trained extractor! If your image set is even a little similar to ImageNet (almost all are), you’ll do better using a pre-trained feature extractor.
Transfer learning opens up a world of state of the art image analysis techniques
Image segmentation
Object detection
Semantic segmentation
We’ll talk about these in our next class!